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Abstract A complete periodic star extraction and classification scheme is set up 
and tested with the Hipparcos catalogue. The efficiency of each step is derived by 
comparing the results with prior knowledge coming from the catalogue or from the 
,— I literature. A combination of two variability criteria is applied in the first step to 

r</ , select 17006 variability candidates from a complete sample of 115 152 stars. Our 

J/^ candidate sample turns out to include 10406 known variables (i.e., 90% of the total 

of 11 597) and 6600 contaminating constant stars. A random forest classification is 
O used in the second step to extract 1881 (82%) of the known periodic objects while 

removing entirely constant stars from the sample and limiting the contamination of 
non-periodic variables to 152 stars (7.5%). The confusion introduced by these 152 
^ non-periodic variables is evaluated in the third step using the results of the Hipparcos 

Ci periodic star classification presented in a previous study (Dubath et al. ifTl). 



QQ 1 Introduction 

m 
m 



Current and forthcoming photometric surveys are monitoring very large numbers 
of astronomical targets providing a fantastic ocean for fishing interesting variable 
objects. However, because of the large numbers involved, their extraction requires 
the use of fully automated and efficient data mining techniques. In this contribution, 
we use the Hipparcos data set to investigate the performance of a complete and au- 
tomated scheme for the identification and the classification of periodic variables. As 
shown in Fig. [T] we study a three step process. In the first step variable candidates 
are separated from the objects most likely to be constant. This saves significant pro- 
H cessing time as period search is performed only in the subset of variable candidates. 

The validity of the detected periods is established in the second step, which sepa- 
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rates truly periodic from non-periodic objects. The third step is the classification of 
periodic variables into a list of types (only a sub-set of them are shown in Fig. [Th. 
This step is presented in details in Dubath et al. 11]. To avoid unnecessary repeti- 
tion, this first paper is referred to for a full description of the classification attribute 
calculation and of the details of the random forest methodology. 



Sample of Surveyed Stars 



1 . Variability detection 



Variable Stars (10 to 30%) 



2. Periodicity detection 




Non-Periodic Stars 



Fig. 1 Illustration of the steps used in this study to identify and classify variable sources 

This three step organisation represents a particular option. Alternatives are also 
being considered, but they are outside the scope of this contribution as is the classi- 
fication of non-periodic variables which is the subject of another study (Rimoldini 
et al., in preparation). 



2 VariabUity Detection 

In order to select variable star candidates, a number of variability criteria are com- 
puted from the Hipparcos light curve^ These criteria are all, in one way or another, 
characterizing an excess of scatter compared to the one expected from random noise. 
Some of them rely on the noise estimations while others do not. P-values are com- 
puted for each of the tests. The star is accepted as a variable candidate if the p-value 
is smaller than a specified threshold. 



^ Only data point with quality flags and 1 have been used in the light curves and stars with 
light-curves with less than 5 good data points are discarded. 
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Chi2 Criterion 



Stetson Criterion 
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Fig. 2 Number of stars selected using the chi-square criterion (left) and the Stetson fT\ one (right) 
as a function of the p-value threshold. The total numbers of selected stars are displayed in black, 
while the fraction of periodic and variable stars are shown in magenta and blue, respectively. The 
numbers of variable stars drawn in blue include the contribution of periodic and non-periodic ob- 
jects. The difference between the blue and the black curves indicates the amount of contamination 
from non- variable stars. The two horizontal lines indicate the total number of periodic stars (2672 
in magenta) and of variable stars (11 453 in blue). The complete sample includes 115 152 stars. 



Figurel2]shows the number of selected sources as a function of the p-value thresh- 
old obtained from a chi-square criterion in the left panel and from an alternative 
criterion proposed by Stetson |2| in the right one. 

As expected, the number of selected stars increases with larger p-values thresh- 
olds in both panels. The optimum threshold maximizes the numbers of selected true 
variables while limiting the contamination by false positives (i.e., by constant stars). 
The chi-square criterion is efficient at finding variables, but it also includes a large 
number of false positives, even when the threshold is extremely small. This suggests 
that Hipparcos photometric errors may be slightly underestimated. The Stetson cri- 
terion is quite efficient for periodic variables and it limits better the number of false 
positive detections, but it misses more non-periodic variable stars. 

Figurel3]shows a comparison of the numbers of stars selected using different vari- 
ability criteria tested with a particular near-optimum p-value threshold. The variabil- 
ity criteria tested include (1) the chi-square criterion, (2) the skewness and (3) the 
kurtosis of the magnitude distributions, the (4) Abbe criterion (e.g., see Strunov |3j]), 
(5) the interquartile range, (6) the Stetson criterion, (7) the outlier median criterion 
and (8) the union of the Stetson and interquartile criteria. 

This figure shows again that the chi-square is the most efficient criterion at identi- 
fying variable stars, but that it also includes the largest contribution of false positive 
detections. A final sample of 17006 variable candidates (i.e, 14.8% of the total) is 
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■ # Candidates 
ITruePos 

False Pos 

Const flagged Var 



Fig. 3 Comparison of the numbers of stars selected using different variability criteria and a par- 
ticular near-optimum p-value threshold. The total numbers of variable candidates appear in blue, 
red bars show the number of candidates flagged in the Hipparcos catalogue as variables (i.e., the 
true positive detections), yellow bars indicates the number of false positives. The fraction of false 
positives flagged as "constant" in the Hipparcos catalogue are shown in green. 



formed by merging the Stetson and the interquartile selections obtained with p-value 
thresholds of 10^^ and 10^^, respectively. This sample is used in the subsequent 
steps of this study. 



3 Periodicity Detection 



Fig.fTlindicates that periodicity detection is the second step. This figure might how- 
ever be misleading as it assumes that the first step is perfect. In reality, the second 
step starts with a sample of variable star candidates, which includes a number of 
constant objects. With the knowledge of the Hipparcos catalogua^ we know quite 
precisely what mixture of stars is included in our selection. Out of the 17 006 candi- 
date sample, (1) 2657 stars are flagged as periodic in the Hipparcos catalogue (flag 
H52), (2) 6954 as unsolvec^^ (3) 794 micro-variable, (4) 762 as constant and (5) 
4360 are not flagged, because they were not considered variable nor constant with 



^ http://www.rssd.esa.int/index.php?project=HIPPARCOS&page=Overview 
' Stars flagged as "Unsolved" have Hipparcos light curves from which it was not possible to derive 
significant evidence for a period. They may include periodic stars with light-curves of insufficient 
quality or truly non-periodic sources. 
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any degree of confidencqj These stars are used to train and test the performance of 
a random forest supervised classifier for identifying periodic variables. 

Using the procedures and criteria described in section 3 of Dubath et al. 1 1 1, a 
good period is obtained for 2323 of the 3022 stars with a known period included 
in our 17 006 star sample (i.e., a good period recovery rate of 77%). There are 357 
stars flagged as periodic with wrong period values. Those are eliminated from our 
training set as well as the 20 "unsolved" stars for which a good period value is 
found. 

A large number of attributes are computed and the procedure presented in sec- 
tion 4 of Dubath et al. HI is followed to rank and select the most important at- 
tributes. Figure|4]displays the results of a series of ten experiments of 10-fold Cross- 
Validation (CV), i.e., 100 experiments for each attribute number. 



T" 



n I \ \ I \ \ I \ r^ 
10 11 12 13 14 15 16 17 18 19 



7 8 9 
Number of classification attributes 



Fig. 4 Evolution of the Cross-Validation (CV) error rate as more and more attributes are added into 
the classification process. In the different CV experiments (hundred for each attribute number), the 
exact list of attributes used to estimate the error rate is not always exactly the same (see Fig.lSl. 

Figure [4] shows that the three most important attributes already drive the mean 
error down to 27%, which reduces to 20% with 7 attributes. The mean error contin- 
ues to decrease slowly until it reaches a plateau of 18.5%. Using more than about 
15 attributes does not lead to further significant improvements. 

Figure l5] displays the ranking of the most important attributes in the CV exper- 
iments. Fig. HI and l5] should be read together While Fig. l4] shows that experiments 
done with 3 attributes result in a mean error rate of 27%, Fig. [5] indicates that most 



* 662 stars flagged as "R" (for revised color index) and 816 stars flagged as "D" (duplicity-induced 
variability) in the Hipparcos catalogue are not included in our training set (see page 121 of the 
Hipparcos catalogue). 
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of the time the 3 most important attributes are those labelled (a), (b), and (c). Num- 
bers in this figure indicate the number of time that the attribute has a particular rank 
in the series of 100 experiments. 



a. Stetson criterion 

b. Logio(range) 

c. Normalized p2p scatter 

d. QSO„a, criterion 
e. Logio(QSO probability) 

f. Period search FAP 

g. P2p scatter: P/raw 

Fig. 5 This figure shows the h. 0.1 to 1 day variance 

ranking of the most important 

attributes in the ten series 12 3 4 

of 10-fold cross-validation 
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Below we provide a short description of the most important attributes displayed 
inFig.|5j 

a Stetson criterion - Stetson variability index ^ pairing successive measurements 
if separated by less than 0.05 days. This time interval is optimized to be long 
enough to make many pairs while remaining much shorter than typical period 
values. 

b Logio(range) - decadic log of the range of the raw time series magnitudes. 

c Normalized p2p scatter - Point-to-point scatter computed on the folded time- 
series normalized by the mean of the square of the measurement errors. 

d QSOvar criterion - Reduced x^ of the source variability with respect to a parametrized 
quasar variance model (denoted by xhsol^ ^^ Butler & Bloom |4|). 

e LogioCQSO probability) - Logio of a quantity defined by Eq. 8 in Butler & 
Bloom |4|. 

f Period search FAP - False-alarm probability associated with the maximum peak 
in the Lomb-Scargle periodogram. 

g P2p scatter: P/raw - Point-to-point scatter from folded time-series normalized by 
the same quantity computed on raw time-series. 

h Variance within 0. 1 to 1 day intervals - The average of absolute magnitude vari- 
ations on time scales from 0.1 to 1 days. 

Figure l6] shows the confusion matrix obtained from the out-of-bag samples of a 
2000-trees random forest classification. Out of the 2300 periodic stars with good 
periods, 1881 (82%) are correctly identified while 419 (18%) are missed, mostly 
appearing in the "unsolved" category. Remarkably, only 152 stars (134 unsolved, 
9 micro-variables and 9 stars without flags) are wrongly classified as "periodic", 
resulting in a total contamination of the periodic type of 7.5%. There is also no 
confusion between "constant " and "periodic" types. 
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Fig. 6 Confusion matrix 
obtained with the out-of- 
bag samples in a 2000-trees 
random forest classification. 
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4 Impact on periodic star classification 

The classification of the Hipparcos periodic variable stars is the subject of a previous 
study (Dubath et al. |1 1). The confusion matrix obtained in that study is displayed 
in Fig. |7] This figure represents however an optimistic picture as this sample only 
contains the best known stars, for which we have relatively clean light-curves. It 
is very difficult to evaluate accurately the extent of the expected degradation when 
using this model to classify other stars. Some indications can however be derived in 
two different ways. 

First, the classification model derived from the training set can be applied to the 
sample of Hipparcos stars with uncertain types from the literature. The results of 
this process is shown in Fig. 10 and 11 by Dubath et al. |[T|, where a relatively mild 
confusion is observed and evaluated. 

Second, the current study shows that any sample of periodic stars is expected to 
be contaminated by non-periodic stars because of the imperfection of the two pre- 
liminary steps, namely variability and periodicity detections. This contamination is 
evaluated to about 7.5% in last section (see Fig.l6]l. The 152 stars wrongly identified 
as periodic can be classified using a periodic classification model to evaluate more 
precisely the contamination in terms of periodic types. 

A 10-fold cross-validation experiment is carried out to extract the variables 
wrongly classified as periodic: 150 stars, including 133 "Unsolved", 8 "Micro- 
variable" and 9 with no flag. These numbers slightly differ from the corresponding 
ones in Fig. l6] due to the randomness involved in random forest classification. The 
classification model from Dubath et al. HI is then used to predict periodic types for 
these stars. 
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Fig. 7 Confusion matrix obtained by Dubath et al. 1 1 1 for the Hipparcos periodic variable stars. 



The predicted types for the 150 stars turn out to include 125 Long-Period Vari- 
ables (LPVs), 9 RR Lyrae of type AB, 6 Delta Scuti and eclipsing binaries of type 
EA (5) and EB (5). The LPV classification prediction for 125 stars could easily be 
understood if they had large amplitude, red color and long (most probably spurious) 
periods as expected for such kind of stars. However, this is not supported by the data. 
The understanding of the true nature of these stars requires further investigation. 
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